Data communication and data science

GEOG 30323

November 29, 2016

Course recap

  • Thus far: we’ve focused on exploratory data analysis, which involves data wrangling, summarization, and visualization
  • Your data analysis journey shouldn’t stop here! Topics to consider:
    • Explanatory vs. exploratory visualization
    • Statistics and data science
    • Data ethics and “big data” (next week)

Communicating with data

  • Once you’ve done all of the hard work wrangling your data, you’ll want to communicate insights to others!
  • This might include:
    • Polished data products or reports
    • Models that can scale your insights

Explanatory visualization

  • We’ve largely worked to this point with exploratory visualization, which refers to internally-facing visualizations that help us reveal insights about our data
  • Often, externally-facing data products will include explanatory visualization, which include a polished design and emphasize one or two key points

Interactive reports

  • Example: a data journalism article - or your Jupyter Notebook!
  • Key distinction: your code, data exploration, etc. will likely be external to the report (this can vary depending on the context, however)

Tableau

  • Highly popular software for data visualization - both exploratory and explanatory
  • Intuitive, drag-and-drop interface
  • Key feature: the dashboard

Data dashboards

Demo: Tableau Public

Infographics

Source: Metro.us

Infographics

Obesity infographics:

Are infographics useful?

Data Science

  • Data science: new(ish) field that has emerged to address the challenges of working with modern data
  • Fuses statistics, computer science, visualization, graphic design, and the humanities/social sciences/natural sciences…

The data analysis process

Visualization vs. modeling

Hadley Wickham (paraphrased):

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.

Statistical modeling

  • What is the mathematical relationship between an outcome variable \(Y\) and one or more other “predictor” variables \(X_{1}...X_{n}\)?
  • Recall our use of lmplot in seaborn - lm stands for linear model

Statistical modeling

The linear model:

\[ Y = Xb + e \]

where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”

  • Linear models will not always be appropriate for modeling relationships between variables!

Statistics in Python

  • Substantial statistical functionality available in the statsmodels package, which installs with Anaconda

Statistics in Python

Let’s get an example ready:

import pandas as pd
import seaborn as sns
sns.set_context('notebook')
import statsmodels.formula.api as smf

df = pd.read_csv('http://personal.tcu.edu/kylewalker/data/texas_colleges.csv')
df['grad_rate'] = df.grad_rate * 100

Linear regression

f = smf.ols(formula = 'median_earn ~ grad_rate', data = df).fit()

f.summary()

Multiple regression

f2 = smf.ols(formula = 'median_earn ~ grad_rate + sat_avg', data = df).fit()
f2.summary()

Residuals and fitted values

df['fitted'] = f2.predict()
df['resid'] = f2.resid

Residuals and fitted values

import cufflinks as cf
cf.go_offline()

df.iplot(x = 'fitted', y = 'resid', kind = 'scatter', mode = 'markers', 
        text = 'instnm', zerolinecolor = 'red', color = 'blue', 
        xTitle = 'Fitted values', yTitle = 'Residuals')

Residuals and fitted values

Machine learning

  • “The science of getting computers to act without being explicitly programmed”
  • Types of machine learning algorithms: supervised and unsupervised
  • Topics in machine learning: classification, clustering, regression

Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

In Python: scikit-learn

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors

df = pd.read_csv('http://personal.tcu.edu/kylewalker/data/dec8.csv', index_col = 'name')

df.head()

Example: K-means clustering

np.random.seed(1983)

km = KMeans(n_clusters = 7).fit(df)

df['clusters'] = km.labels_

# Check TCU's cluster
df.ix['Texas Christian University'] 

Example: K-means clustering

from ipywidgets import interact

def glimpse_clusters(cluster_id):
    sub = df[df.clusters == cluster_id]
    print(sub.head(20))
    
interact(glimpse_clusters, cluster_id = (0, 6))
neigh = NearestNeighbors(n_neighbors = 5)

# "Training" the model
neigh.fit(df) 

# Searching for neighbors
model = neigh.kneighbors(df, return_distance = False)
results = pd.DataFrame(model, columns = ['x1', 'x2', 'x3', 'x4', 'x5'])
merged = pd.merge(df.reset_index(), results, right_index = True, left_index = True)

Example: nearest-neighbor search

def find_neighbors(university): 
    d = merged[merged.name == university].reset_index()
    for x in ['x2', 'x3', 'x4', 'x5']: 
        idx = d.iloc[0][x]
        m = merged.ix[idx]
        print(m['name'])

interact(find_neighbors, university = 'Texas Christian University')

Making predictions


How to learn more